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Principal Component Analysis (PCA) is an important tool of 
dimension reduction especially when the dimension (or the number 
of variables) is very high. Asymptotic studies where the sample size 
is fixed, and the dimension grows [i.e., High Dimension, Low Sam- 
ple Size (HDLSS)] are becoming increasingly relevant. We investigate 
the asymptotic behavior of the Principal Component (PC) directions. 
HDLSS asymptotics are used to study consistency, strong inconsis- 
tency and subspace consistency. We show that if the first few eigen- 
values of a population covariance matrix are large enough compared 
to the others, then the corresponding estimated PC directions are 
consistent or converge to the appropriate subspace (subspace consis- 
tency) and most other PC directions are strongly inconsistent. Broad 
sets of sufficient conditions for each of these cases are specified and the 
main theorem gives a catalogue of possible combinations. In prepa- 
ration for these results, we show that the geometric representation 
of HDLSS data holds under general conditions, which includes a p- 
mixing condition and a broad range of sphericity measures of the 
covariance matrix. 

1. Introduction and summary. The High Dimension, Low Sample Size 
(HDLSS) data situation occurs in many areas of modern science and the 
asymptotic studies of this type of data are becoming increasingly relevant. 
We will focus on the case that the dimension d increases while the sample size 
n is fixed as done in Hall, Marron and Neeman [8] and Ahn et al. [1]. The d- 
dimensional covariance matrix is challenging to analyze, in general, since the 
number of parameters is d ^ d ^ ; which increases even faster than d. Instead 
of assessing all of the parameter estimates, the covariance matrix is usually 
analyzed by Principal Component Analysis (PCA). PCA is often used to 
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Fig. 1. Scatterplots of data projected on the first three PC directions. The dataset con- 
tains 56 patients with 2530 genes. There are 20 Pulmonary Carcinoid (plotted as +). 13 
Colon Cancer Metastases (*). 17 Normal Lung (o), and 6 Small Cell Carcinoma (x ). In 
spite of the high dimensionality, PC A reveals important structure in the data. This corre- 
sponds to the consistent case in our asymptotics, as shown in the scree plot on the right. 
Note that the first few eigenvalues are much larger than the rest. 

visualize important structure in the data, as shown in Figure 1. The data in 
Figure 1, described in detail in Bhattacharjee et al. [4] and Liu et al. [15], 
are from a microarray study of lung cancer. Different symbols correspond 
to cancer subtypes, and Figure 1 shows the projections of the data onto the 
subspaces generated by PCI and PC2 (left panel) and PCI and PC3 (center 
panel, resp.) directions. This shows the difference between subtypes is so 
strong that it drives the first three principal components. This illustrates a 
common occurrence: the data have an important underlying structure which 
is revealed by the first few PC directions. 

PC A is also used to reduce dimensionality by approximating the data 
with the first few principal components. 

For both visualization and data reduction, it is critical that the PCA em- 
pirical eigenvectors reflect true underlying distributional structure. Hence, 
our focus is on the underlying mechanism which determines when the sam- 
ple PC directions converge to their population counterparts as d — ► oo. In 
general, we assume d> n. Since the size of the covariance matrix depends 
on d, the population covariance matrix is denoted as T, d and similarly the 
sample covariance matrix, Sd, so that their dependency on the dimension 
is emphasized. PCA is done by eigen decomposition of a covariance matrix. 
The eigen decomposition of T, d is 

S d = U d A d U' d , 

where A^ is a diagonal matrix of eigenvalues Ai^ > > • • • > ^d,d and U d 
is a matrix of corresponding eigenvectors so that U d = [u\ )d ,U2, d , ■ ■ ■ ,u djd \. 
S d is similarly decomposed as 



S d = U d A d U' d . 
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Ahn et al. [1] developed the concept of HDLSS consistency which was 
the first investigation of when PCA could be expected to find important 
structure in HDLSS data. Our main results are formulated in terms of three 
related concepts: 

1. Consistency: The direction ii^d is consistent with its population counter- 
part if Angle (itj di iii d) ~ > as d — > oo. The growth of dimension can 
be understood as adding more variation. The consistency of sample eigen- 
vectors occurs when the added variation supports the existing structure 
in the covariance or is small enough to be ignored. 

2. Strong inconsistency: In situations where Ui d is not consistent, a perhaps 
counter-intuitive HDLSS phenomenon frequently occurs. In particular, 
Ui : d is said to be strongly inconsistent with its population counterpart 
Ui t d in the sense that it tends to be as far away from as possible, 
that is, Angle(itj i( 2, u^d) — ► § as d— > oo. Strong inconsistency occurs when 
the added variation obscures the underlying structure of the population 
covariance matrix. 

3. Subspace consistency: When several population eigenvalues indexed by 
j € J are similar, the corresponding sample eigenvectors may not be dis- 
tinguishable. In this case, Uj t d will not be consistent for Uj d but will tend 
to lie in the linear span, spanjuj ^ :j E J}. This motivates the definition of 
convergence of a direction Ui t d to a subspace, called subspace consistency; 

Angle(n iid ,span{n iid : j G J}) — >0 

as d— ► oo. This definition essentially comes from the theory of canon- 
ical angles discussed by Gaydos [7]. That theory also gives a notion of 
convergence of subspaces, that could be developed here. 

In recent years, substantial work has been done on the asymptotic behav- 
ior of eigenvalues of the sample covariance matrix in the limit as d — > oo, see 
Baik, Ben Arous and Peche [2], Johnstone [11] and Paul [16] for Gaussian 
assumptions and Baik and Silverstein [3] for non-Gaussian results when d 
and n increase at the same rate, that is, 4 — > c > 0. Many of these focus 
on the spiked covariance model, introduced by Johnstone [11]. The spiked 
covariance model assumes that the first few eigenvalues of the population 
covariance matrix are greater than 1 and the rest are set to be 1 for all d. 
HDLSS asymptotics, where only d— > oo while n is fixed, have been studied 
by Hall, Marron and Neeman [8] and Ahn et al. [1]. They explored condi- 
tions which give the geometric representation of HDLSS data (i.e., modulo 
rotation, data tend to lie at vertices of a regular simplex) as well as strong 
inconsistency of eigenvectors. Strong inconsistency is also found in the con- 
text of ^ — > c, in the study of phase transition; see for example, Paul [16], 
Johnstone and Lu [12] and Baik, Ben Arous and Peche [2]. 

A reviewer pointed out a useful framework for organizing these variation 

is: 
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1. Classical: d(n)/n — > 0, as n — > oo. 

2. Random matrices: d(n)/n — > c, as n — > oo. 

3. HDLSS: n fixed, with d— > oo. 

We view all of these as informative. Which is most informative will depend on 
the particular data analytic setting, in the same way that either the Normal 
or Poisson approximation can be "most informative" about the Binomial 
distribution. 

In this paper, we focus only on the HDLSS case, and a broad and gen- 
eral set of conditions for consistency and strong inconsistency are provided. 
Section 2 develops conditions that guarantee the nonzero eigenvalues of the 
sample covariance matrix tend to an increasing constant, which are much 
more general than those of Hall, Marron and Neeman [8] and Ahn et al. 
[1]. This asymptotic behavior of the sample covariance matrix is the basis 
of the geometric representation of HDLSS data. Our result gives a broad 
new insight into this representation as discussed in Section 3. The central 
issue of consistency and strong inconsistency is developed in Section 4, as a 
series of theorems. For a fixed number k, we assume the first k eigenvalues 
are much larger than the others. We show that when k = 1, the first sample 
eigenvector is consistent and the others are strongly inconsistent. We also 
generalize to the k > 1 case, featuring two different types of results (consis- 
tency and subspace consistency) according to the asymptotic behaviors of 
the first k eigenvalues. All results are combined and generalized in the main 
theorem (Theorem 2). Proofs of theorems are given in Section 5. 

1.1. General setting. Suppose we have adxn data matrix Xr d ) = \X\(d) ; 
. . . , X n> (A\\ with d > n, where the cf-dimensional random vectors X^t^, . . . , 
X n ,(d) are independent and identically distributed. We assume that each 
follows a multivariate distribution (which does not have to be Gaus- 
sian) with mean zero and covariance matrix Yi d . Define the sphered data 

— 1/2 

matrix Z/ d \ = A d U' d X^ . Then the components of the d x n matrix 
have unit variances, and are uncorrelated with each other. We shall regulate 
the dependency (recall for non-Gaussian data, uncorrelated variables can 
still be dependent) of the random variables in by a /3-mixing condition. 
This allows serious weakening of the assumptions of Gaussianity while still 
enabling the law of large numbers that lie behind the geometric representa- 
tion results of Hall, Marron and Neeman [8]. 

The concept of p-mixing was first developed by Kolmogorov and Rozanov 
[14]. See Bradley [5] for a clear and insightful discussion. For — oo < J < L < 
oo, let J- j denote the cr-field of events generated by the random variables 
(Zi, J < i < L). For any cr-field A, let Li{A) denote the space of square- 
integrable, A measurable (real- valued) random variables. For each m > 1, 
define the maximal correlation coefficient 

p(m) :=sup|corr(/,5)|, / € L 2 {T 3 _ O0 ),g G L 2 (Fj+ m ), 
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where sup is over all /, g and jgZ. The sequence {Z{\ is said to be p-mixing 
if p(m) — > as m — > oo. 

While the concept of p-mixing is useful as a mild condition for the de- 
velopment of laws of large numbers, its formulation is critically dependent 
on the ordering of variables. For many interesting data types, such as mi- 
croarray data, there is clear dependence but no natural ordering of the vari- 
ables. Hence, we assume that there is some permutation of the data which 
is p-mixing. In particular, let {Zjy^liLi be the components of the j'th col- 
umn vector of Z^y We assume that for each d, there exists a permutation 
nd '■ {1, • • • , d} i — ► {1, . . . , d} so that the sequence {Z^My u) :i = 1, ■ . ■ ,d} is 
p-mixing. This assumption makes the results invariant under a permutation 
of the variables. 

In the following, all the quantities depend on d, but the subscript d will be 
omitted for the sake of simplicity when it does not cause any confusion. The 
sample covariance matrix is defined as S = n XX'. We do not subtract the 
sample mean vector because the population mean is assumed to be 0. Since 
the dimension of the sample covariance matrix S grows, it is challenging to 
deal with S directly. A useful approach is to work with the dual of S. The 
dual approach switches the role of columns and rows of the data matrix, by 
replacing X by X' . The n x n dual sample covariance matrix is defined as 
Sd = n~ l X' X. An advantage of this dual approach is that Sd and S share 
nonzero eigenvalues. If we write X as V ' A X I 2 Z and use the fact that U is a 
unitary matrix, 

d 

(1.1) nS D = (Z'A^U'XUA^Z) = Z'AZ = £ X i>d z' iZi , 

i=l 

where the z^s, i = 1, . . . ,d, are the row vectors of the matrix Z. Note that 
nS d is commonly referred to as the Gram matrix, consisting of inner prod- 
ucts between observations. 

2. HDLSS asymptotic behavior of the sample covariance matrix. In this 
section, we investigate the behavior of the sample covariance matrix S when 
d — > oo and n is fixed. Under mild and broad conditions, the eigenvalues of S, 
or the dual Sd, behave asymptotically as if they are from the identity matrix. 
That is, the set of sample eigenvectors tends to be an arbitrary choice. This 
lies at the heart of the geometric representation results of Hall, Marron and 
Neeman [8] and Ahn et al. [1] which are studied more deeply in Section 3. 
We will see that this condition readily implies the strong inconsistency of 
sample eigenvectors; see Theorem 2. 

The conditions for the theorem are conveniently formulated in terms of a 
measure of sphericity 

tr 2 (S) _ (EtiA M ) 2 
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proposed and used by John [9, 10] as the basis of a hypothesis test for 
equality of eigenvalues. Note that these inequalities always hold: 

Also note that perfect sphericity of the distribution (i.e., equality of eigen- 
values) occurs only when e = 1. The other end of the e range is the most 
singular case where in the limit as the first eigenvalue dominates all others. 

Aim et al. [1] claimed that if e 3> ^, in the sense that e~ l = o(d), then 
the eigenvalues of Sd tend to be identical in probability as d — > oo . How- 
ever, they needed an additional assumption (e.g., a Gaussian assumption 
on X( d ^) to have independence among components of Z^, as described in 
Example 3.1. In this paper, we extend this result to the case of arbitrary 
distributions with dependency regulated by the p-mixing condition as in 
Section 1.1, which is much more general than either a Gaussian or an inde- 
pendence assumption. We also explore convergence in the almost sure sense 
with stronger assumptions. Our results use a measure of sphericity for part 
of the eigenvalues for conditions of a.s. convergence and also for later use in 
Section 4. In particular, define the measure of sphericity for {\k,di ■ ■ ■ , ^d,d} 
as 

(J2i=k \d) 2 



,d 

For convenience, we name several assumptions used in this paper made 
about the measure of sphericity e: 

• The e- condition: £>^, that is, 

T d _ A 2 

(2.1) (de)~ 1 = ~ X 14 ->0 asd^oo. 

(Ef=i \,d) 

• The Ek- condition: Ek 3> \ , that is, 

(2.2) (de k r 1 = ,B =k ^'\ 2 - asd-oo. 

• The strong Ek-condition: For some fixed / > k, si 3> -^=, that is, 

Jl/2 W \ 2 

(2.3) d-y 2 eT x = - ^ l=l asd^oo. 

Remark. Note that the e^-condition is identical to the e-condition when 
k = 1. Similarly, the strong e^-condition is also called the strong e-condition 
when k = 1. The strong e^-condition is stronger than the Ek condition if the 
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minimum of Vs which satisfy (2.3), l Q , is as small as k. But, if l Q > k, then 
this is not necessarily true. We will use the strong e^-condition combined 
with the e^-condition. 

Note that the e-condition is quite broad in the spectrum of possible values 
of e: It only avoids the most singular case. The strong e-condition further 
restricts e; to essentially in the range (-^, 1]. 

The following theorem states that if the (strong) e-condition holds for 
then the sample eigenvalues behave as if they are from a scaled identity 
matrix. It uses the notation I n for the n x n identity matrix. 

Theorem 1. For a fixed n, let S d = U d A d U' d , d = n + 1, n + 2, . . . , be 
a sequence of covariance matrices. Let X/ d ^ be a d x n data matrix from a 
d-variate distribution with mean zero and covariance matrix T, d . Let S d = 
(j d k d U' d be the sample covariance matrix estimated from Xm) f or eac ^ d an d 
let Sr),d be its dual. 

(1) Assume that the components of = A d U' d X^ have uniformly 
bounded fourth moments and are p-mixing under some permutation. If (2.1) 
holds, then 

(2.4) cfS D4 ^I n , 

in probability as d^ oo, where c d = n _1 J2i=i \,d- 

— 1/2 

(2) Assume that the components of Zi d ) = A d U' d X^ have uniformly 
bounded eighth moments and are independent to each other. If both (2.1) 
and (2.3) hold, then So,d In almost surely as d—> oo. 

The (strong) e-condition holds for quite general settings. The strong e- 
condition combined with the e-condition holds under: 

(a) Null case: All eigenvalues are the same. 

(b) Mild spiked model: The first m eigenvalues are moderately larger than 
the others, for example, Ai^ = • • • = X m ,d = C\ • d a and \ m +i :d = • • • = 
^d,d = Cii where m < d, a < 1 and C*i, Ci > 0. 

The e-condition fails when: 

(c) Singular case: Only the first few eigenvalues are nonzero. 

(d) Exponential decrease: Xi jd = c~ % for some c > 1. 

(e) Sharp spiked model: The first m eigenvalues are much larger than the 
others. One example is the same as (b), but a > 1. 

The polynomially decreasing case, Aj^ = is interesting because it 
depends on the power f3: 
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(f-1) The strong e-condition holds when < /? < |. 

(f-2) The e-condition holds, but the strong e-condition fails when | < (3 < 1. 
(f-3) The e-condition fails when /3>1. 

Another family of examples that includes all three cases is the spiked 
model with the number of spikes increasing, for example, Ai ^ = • • • = X m ,d = 
C\ ■ d a and \ m +i,d = • • • = A^ = C 2 , where m = [d@\ , < j3 < 1 and C±, C 2 > 
0: 

(g-1) The strong e-condition holds when < 2a + j3 < |. 
(g-2) The e-condition holds but the strong e-condition fails when | < 2a + 
(3<2. 

(g-3) The e-condition fails when 2a + /3 > 2. 

3. Geometric representation of HDLSS data. Suppose X ~ Md(0, Id)- 
When the dimension d is small, most of the mass of the data lies near 
origin. However, with a large d, Hall, Marron and Neeman [8] showed that 
Euclidean distance of X to the origin is described as 

(3.1) \\X\\=Vd + o p (Vd). 

Moreover, the distance between two samples is also rather deterministic, 
that is, 

(3.2) \\X 1 -X 2 \\ = V2d + o p (Vd). 

These results can be derived by the law of large numbers. Hall, Marron 
and Neeman [8] generalized those results under the assumptions that 
rf _1 Sf=iVar(Xj)— >1 and {Aj} is p-mixing. 

Application of part (1) of Theorem 1 generalizes these results. Let Xjr^, 
X 2 ^d) De two samples that satisfy the assumptions of Theorem 1 part (1). 
Assume without loss of generality that lim^oo d _1 J2i=i ^i,d = 1- The scaled 
squared distance between two data points is 

II \r v ||2 d d d 

ll A l,(Q ~ A 2,(d)ll 



=1 i=l 



where \id= v ^d > '' d — • Note that by (1.1), the first two terms are diagonal 

elements of c^Sp^ m Theorem 1 and the third term is an off-diagonal 
element. Since c^ l So,d — ► 4, we have (3.2). (3.1) is derived similarly. 

Remark. If lim^oo d' 1 J2i=i \,d = 1> then the conclusion (2.4) of The- 
orem 1 part (1) holds if and only if the representations (3.1) and (3.2) hold 
under the same assumptions in the theorem. 
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In this representation, the /9-mixing assumption plays a very important 
role. The following example, due to John Kent, shows that some type of 
mixing condition is important. 

Example 3.1 (Strong dependency via a scale mixture of Gaussian). Let 
X = Y\U + <7Y2(1 — U), where Y\, Yi are two independent Afd(0, Id) random 
variables, U = or 1 with probability ^ and independent of Y\, Y2, and 
a>l. Then, 



\X\ 



d^ + Opil), w.p. I, 

1 



ad 1 / 2 + O p (1), w.p. 



2' 

.2 



Thus, (3.1) does not hold. Note that since Cov(X) = 1 T 2 CT Id, the e-condition 
holds and the variables are uncorrelated. However, there is strong depen- 
dency, i.e., Cov(zf,z]) = (±^)- 2 Cov(x 2 ,x 2 ) = (j^) 2 for all i ^ j which 
implies that p(m) > c for some c > 0, for all m. Thus, the /j-mixing condi- 
tion does not hold for all permutation. Note that, however, under Gaussian 
assumption, given any covariance matrix S, Z = E _1 / 2 X has independent 
components. 

Note that in the case X = (Xi, . . . ,Xd) is a sequence of i.i.d. random 
variables, the results (3.1) and (3.2) can be considerably strengthened to 
= y/d + O p (l), and \\Xx - X 2 \\ = V2d + O p (l). The following example 
shows that strong results are beyond the reach of reasonable assumption. 

Example 3.2 (Varying sphericity). Let X ~ A/rf(0, S rf ), where S d = 

— 1/2 

diag(d Q , 1, . . . , 1) and a G (0, 1). Define Z = T, d X. Then the components 
of Z, independent standard Gaussian random variables. We get 

\\X\\ 2 = d a zl + J2f=2 z i- Now for 0<a<i, cH/2(||x|| 2 - d) Af(0, 1) and 
for I < a < 1, d~ Q (||X|| 2 — d) => zf, where denotes convergence in distri- 
bution. Thus, by the delta-method, we get 



\X\ 



fVd + O p (l), if0<a<i, 
I Vd + O p (d a ^/ 2 ), ifi<a<l. 

In both cases, the representation (3.1) holds. 

4. Consistency and strong inconsistency of PC directions. In this sec- 
tion, conditions for consistency or strong inconsistency of the sample PC 
direction vectors are investigated in the general setting of Section 1.1. The 
generic eigen-structure of the covariance matrix that we assume is the fol- 
lowing. For a fixed number k, we assume the first k eigenvalues are much 
larger than others. (The precise meaning of large will be addressed shortly.) 
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Fig. 2. Projection of a d- dimensional random variable X onto U\ and Vd-i- If a > 1, 
then the subspace Vd-i becomes negligible compared to ui when d— * oo. 

The rest of eigenvalues are assumed to satisfy the e-condition, which is very 
broad in the range of sphericity. We begin with the case k = 1 and generalize 
the result for k > 1 in two distinct ways. The main theorem (Theorem 2) 
contains and combines those previous results and also embraces various cases 
according to the magnitude of the first k eigenvalues. We also investigate the 
sufficient conditions for a stronger result, that is, almost sure convergence, 
which involves use of the strong e-condition. 

4.1. Criteria for consistency or strong inconsistency of the first PC di- 
rection. Consider the simplest case that only the first PC direction of S is 
of interest. Section 3 gives some preliminary indication of this. As an illus- 
tration, consider a spiked model as in Example 3.2 but now let a > 1. Let 
{u,i} be the set of eigenvectors of and V^-i be the subspace of all eigen- 
vectors except the first one. Then the projection of X onto u\ has a norm 
|| Proj Ul X|| = ||Xi || = O p (d a / 2 ). The projection of X onto Vd-\ has a norm 
V« + o p {-\fd) by (3.1). Thus, when a > 1, if we scale the whole data space 
R d by dividing by d a / 2 , then Projy d X becomes negligible compared to 
Proj ul X (see Figure 2). Thus, for a large d, E<^ « \\U\Ui and the variation 
of X is mostly along u±. Therefore, the sample eigenvector corresponding to 
the largest eigenvalue, ui, will be similar to u\. 

To generalize this, suppose the £2 condition holds. The following propo- 
sition states that under the general setting in Section 1.1, the first sample 
eigenvector u\ converges to its population counterpart u\ (consistency) or 
tends to be perpendicular to u\ (strong inconsistency) according to the mag- 
nitude of the first eigenvalue Ai, while all the other sample eigenvectors are 
strongly inconsistent regardless of the magnitude Ai. 

Proposition 1. For a fixed n, let = Ud^dU' d , d = n + 1, n + 2, . . . , 
be a sequence of covariance matrices. Let X^) be a d x n data matrix from 
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a d-variate distribution with mean zero and covariance matrix E^. Let S d = 
U d A d U' d be the sample covariance matrix estimated from X^ for each d. 
Assume the following: 

— 1/2 

(a) The components of = A d U' d X^ have uniformly bounded fourth 
moments and are p-mixing for some permutation. 

For an ol\ > 0, 

(b) — ► c i f or some c\ > 0. 

(c) The £2-condition holds and J2i=2^i-d = 0(d). 

If c\\ > 1, then the first sample eigenvector is consistent and the others are 
strongly inconsistent in the sense that 

Angle(ui,ui) — — > asd— > oo, 

p 7T 

Angle(nj, U{) — ► — as d — > oo Mi = 2, . . . , n. 
If ax & (0, 1) , then all sample eigenvectors are strongly inconsistent, i.e., 

p 7T 

Angle (t^, it j) — >— as d —> oo Vi = 1, n. 

Note that the gap between consistency and strong inconsistency is very 
thin, i.e., if we avoid a± = l, then we have either consistency or strong incon- 
sistency. Thus in the HDLSS context, asymptotic behavior of PC directions 
is mostly captured by consistency and strong inconsistency. Now it makes 
sense to say Ai is much larger than the others when a\ > 1, which results in 
consistency. Also note that if a\ < 1, then the e-condition holds, which is in 
fact the condition for Theorem 1. 

4.2. Generalizations. In this section, we generalize Proposition 1 to the 
case that multiple eigenvalues are much larger than the others. This leads 
to two different types of result. 

First is the case that the first p eigenvectors are each consistent. Consider 
a covariance structure with multiple spikes, that is, p eigenvalues, p > 1, 
which are much larger than the others. In order to have consistency of the 
first p eigenvectors, we require that each of p eigenvalues has a distinct order 
of magnitude, for example, \\ jd = d 3 , A2,d = d 2 and sum of the rest is order 
of d. 

Proposition 2. For a fixed n, let S^, X^, and S d be as before. Assume 
(a) of Proposition 1. Let a± > ct2 > ■ ■ ■ > a p > 1 for some p < re. Suppose the 
following conditions hold: 

(b) ^fe^ — ► Ci for some Cj > \/i = 1, . . . ,p. 
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(c) The Sp+i- condition holds and J2i= P +i ^i,d = 0(d). 

Then, the first p sample eigenvectors are consistent and the others are 
strongly inconsistent in the sense that 

Angle(uj, Ui) — ^ as d — > oo Vi = 1, ... ,p, 

p 

Angle (Ui, Ui) — ► — as d — ► oo Vi =p + 1, . . . , n. 

Consider now a distribution having a covariance structure with multiple 
spikes as before. Let k be the number of spikes. An interesting phenomenon 
happens when the first k eigenvalues are of the same order of magnitude, 
that is, lim^oo = c > 1 for some constant c. Then the first k sample 
eigenvectors are neither consistent nor strongly inconsistent. However, all 
of those random directions converge to the subspace spanned by the first k 
population eigenvectors. Essentially, when eigenvalues are of the same order, 
the eigen-directions can not be separated but are subspace consistent with 
the proper subspace. 

Proposition 3. For a fixed n, let S^, Xu), and Sd be as before. Assume 
(a) of Proposition 1. Let ct\ > 1 and k <n. Suppose the following conditions 
hold: 

(b) ^kj- — ► Ci for some Ci > Vi = 1, . . . , k. 

(c) The £k+i~ condition holds and Ylt=k+i ^id = 0(d). 

Then the first k sample eigenvectors are subspace- consistent with the sub- 
space spanned by the first k population eigenvectors, and the others are 
strongly inconsistent in the sense that 

Angle (ui, spanjui, . . . , u^}) — ^ as d — > oo Vi = 1, . . . , k, 

Angle(Ui,Ui) — ► — as d — > oo Vi = k + 1, ... ,n. 

4.3. Main theorem. Propositions 1-3 are combined and generalized in 
the main theorem. Consider p groups of eigenvalues, which grow at the 
same rate within each group as in Proposition 3. Each group has a finite 
number of eigenvalues and the number of eigenvalues in all groups, k, does 
not exceed n. Also similar to Proposition 2, let the orders of magnitude of 
the p groups be different to each other. We require that the e K +i-condition 
holds. The following theorem states that a sample eigenvector of a group 
converges to the subspace of population eigenvectors of the group. 

Theorem 2 (Main theorem). For a fixed n, let S^, Xu^, and be as 
before. Assume (a) of Proposition 1. Let a\, . . . ,a p be such that ot\ > ct2 > 
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• • • > a p > 1 for some p < re. Let ki,...,k p be nonnegative integers such that 
Sj=i kj = k < n. Let k$ = and = d — k. Let J\,..., J p +i be sets of 
indices such that 

{l-l Z-l Z-l "| 

i + 5^,2+5^,..., a* + 5^- L i=i,...,p+i. 

Suppose the following conditions hold: 

(b) ^fef — > Cj /or some c% > 0, Vz G Jj, V7 = 1, . . . ,p. 

(c) The e K+ \- condition holds and J2ieJ p+1 \,d = 0(d). 

Then the sample eigenvectors whose label is in the group J\, for I = 1, . . . ,p, 
are subspace- consistent with the space spanned by the population eigenvectors 
whose labels are in J\ and the others are strongly inconsistent in the sense 
that 

(4.1) Angle(ui,span{uj :j G J{\) as d — > oo Mi G Ji, V7 = 1, . . . ,p, 
and 

p 7T 

(4.2) Angle(uj, ui) — > — as d — > oo Vz = k + 1, n. 

Remark. If the cardinality of Ji, ki, is 1, then (4.1) implies Uj is con- 
sistent for i G J\ . 

Remark. The strongly inconsistent eigenvectors whose labels are in 
Jp+i can be considered to be subspace-consistent. Let be the subspace 
spanned by the population eigenvectors whose labels are in J p +i for each d, 
i.e. T d = spa,n{uj-.j eJ p+ i} = sp&n{u K+ i,...,u d }. Then 

Angle(nj.d, Td) — > as d — > oo 

for all i G Jp+i. 

Note that the formulation of the theorem is similar to the spiked covari- 
ance model but much more general. The uniform assumption on the under- 
lying eigenvalues, that is, Aj = 1 for all i > k, is relaxed to the e-condition. 
We also have catalogued a large collection of specific results according to 
the various sizes of spikes. 

These results are now illustrated for some classes of covariance matrices 
that are of special interest. These covariance matrices are easily represented 
in factor form, that is, in terms of F d = ■ 
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Example 4.1. Consider a series of covariance matrices {T,d}d- Let Y> c 
FdF' d , where Fd is a d x d symmetric matrix such that 



F d = (l- pd)h + PdJd ■ 



/I Pd ■■■ Pd\ 

Pd 1 '"• : 

: '■• '•• pd 

\pd ■•■ Pd 1/ 



where Jd is the d x d matrix of ones and pd S (0, 1) depends on d. The 
eigenvalues of E d are A M = (dp d + 1 - Pd) 2 ,M,d = ■■■ = X d ,d = (1 - Pd) 2 - 
Note that this is a simple and natural probabilistic mechanism that generates 
eigenvalues where the first is order of magnitude larger than the rest (our 
fundamental assumption). The first eigenvector is u\ = 7^(1) !>•••> 1)') while 
{u<2, . . . ,Ud} are any orthogonal sets of direction vectors perpendicular to 
u\. Note that 2~2i=2 K,d = d(l — Pd) 2 = 0(d) and the £2-condition holds. Let 
Xd ~ A/"d(0, Srf). By Theorem 2, if pd € (0, 1) is a fixed constant or decreases 
to slowly so that pd S> d~ l l 2 , then the first PC direction u\ is consistent. 
Else if pd decreases to so quickly that pd <C ti -1 / 2 , then u\ is strongly 
inconsistent. In both cases, all the other sample PC directions are strongly 
inconsistent. 

Example 4.2. Consider now a 2d x 2d covariance matrix = FdF d , 
where Fd is a block diagonal matrix, such that 

F hd O 
O F 24 

where F 1)d = (1 - pi,d)Id + Pi,d J d and F 24 = (1- p 2 ,d)h + P2,dJd- Suppose < 
P2,d < Pi,d < 1- Note that A M = {dp^ d + 1 - pi,d) 2 , X24 = (dp2,d + 1 - P24) 2 
and the £3-condition holds. Let X 2 d ~ A/2d(0, S^). Application of Theorem 2 
for various conditions on p\ d , p 2 d is summarized as follows. Denote, for two 
nonincreasing sequences pd, ^d G (0, 1), /i,j> u d for = o(//d) and pd h ^d 
for lim d ^ 00 ^ = cG [l,oo): 

1- /°l,d S> P24 ^ c? -1 ^ 2 : Both u\, u 2 consistent. 

2. pi )d h P24 ^ d~ x l 2 : Both u\, u 2 subspace-consistent to span{u\,u 2 } . 

3. pi, d S> (i" 1 / 2 S> P2,d : &i consistent, u 2 strongly inconsistent. 

4. ci" 1 / 2 S> pi jC i S> P2,d : Both ui, u 2 strongly inconsistent. 

4.4. Corollaries to the main theorem. The result can be extended for 
special cases. 

1/2 

First of all, consider constructing X( d ) from Zd by X^ d ) = U d A. d Zd where 
Zd is a truncated set from an infinite sequence of independent random vari- 
ables with mean zero and variance 1. This assumption makes it possible 
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to have convergence in the almost sure sense. This is mainly because the 
triangular array {Z^ ^)}^ becomes the single sequence {Zu}i. 

Corollary 1. Suppose all the assumptions in Theorem 2, with the as- 
sumption (a) replaced by the following: 

— 1/2 

(a') The components of = A d U' d X^ have uniformly bounded eighth 
moments and are independent to each other. Let -Zh ^) = Zu for all 
i, d. 

If the strong e K+ \- condition (2.3) holds, then the mode of convergence of 
(4-1) and (4-2) is almost sure. 

Second, consider the case that both d, n tend to infinity. Under the setting 
of Theorem 2, we can separate PC directions better when the eigenvalues are 
distinct. When d— > oo, we have subspace consistency of Ui with the proper 
subspace, which includes Uj. Now letting n — > oo makes it possible for ui to 
be consistent. 

Corollary 2. Let S^, X^ d ) and be as before. Under the assump- 
tions (a), (b) and (c) in Theorem 2, assume further for (b) that the first k 
eigenvalues are distinct, that is, Ci > Cj for i> j and i,j £ J% for I = 1, . . . ,p. 
Then for all i<K, 

(4.3) Angle (uj, Uj) — — > as d — > oo, n — > oo, 

where the limits are applied successively. 

If the assumption (a) is replaced by the assumption (a' ' ) of Corollary 1, 
then the mode of convergence of (4-3) is almost sure. 

This corollary can be viewed as the case when d, n tend to infinity to- 
gether, but d increases at a much faster rate than n, that is, d^> n. When 
n also increases in the particular setting of the corollary, the sample eigen- 
vectors, which were only subspace-consistent in the d — > oo case, tend to 
be distinguishable and each of the eigenvectors is consistent. We conjecture 
that the inconsistent sample eigenvalues are still strongly inconsistent when 
d, n — > oo and d^> n. 

4.5. Limiting distributions of corresponding eigenvalues. The study of 
asymptotic behavior of the sample eigenvalues is an important part in the 
proof of Theorem 2, and also could be of independent interest. The following 
lemma states that the large sample eigenvalues increase at the same speed as 
their population counterpart and the relatively small eigenvalues tend to be 
of order of d as d tends to infinity. Let ipi (^4) denote the zth largest eigenvalue 
of the symmetric matrix A and (fij(A) = ifi* (A) where i* =i — Yjj=i kj- 
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Lemma 1. If the assumptions of Theorem 2 hold, and let Zi be a ki x n 
matrix from blocks of Z as defined in (5.2), then 

\i/d ai ^^rji as d — > oo if i G Ji V7 = 1, ... ,p, 

Xi/d-^K as d — > oo if i = n + 1, . . . , n, 

where each rji is a random variable whose support is (0, oo) almost surely and 
indeed r\i = ip^^n^Cf ZiZ^C 1 / 2 ) for each i G J\, where Q = diagje,- :j G 
Jl} and K = \im d ^ 00 {dn)'' 1 Y lie j p+1 X isd . 

If the data matrix Xr d \ is Gaussian, then the first k sample eigenvalues 
converge in distribution to some quantities, which have known distributions. 

Corollary 3. Under all the assumptions of Theorem 2, assume further 
that Xr d \ ~ Af d (0, T, d ) for each d. Then, for i& J\, I = 1, . . . ,p, 

^=^fi,i(n~ 1 Wk l (n,Ci)) asd^oo, 

where Wk (re, Cf) denotes ak[Xki random matrix distributed as the Wishart 
distribution with degree of freedom n and covariance C\. 
If ki = \ for some I, then for i G Ji 

^ Xn r 

— — as d— >oo, 
Xi n 

where Xn denotes a random variable distributed as the x 2 distribution with 
degree of freedom n. 

This generalizes the results in Section 4.2 of Aim et al. [1]. 

5. Proofs. 

5.1. Proof of Theorem 1. First, we give the proof of part (1). By (1.1), 
the mth diagonal entry of nSo can be expressed as J2i=i \,dZ 2 m d where 
z im,d is the (i,m)th entry of the matrix Z^ d y Define the relative eigenvalues 

\ d as Xi a = ^d l ' d — • Let n d denote the given permutation for each d and 

let Yi = zl d(i)m d -l. Then the Fj's are p-mixing, E(Fj) = and E(Y?) < B 
for all i for some B < oo. Let p{m) = sup| corr(Yj, Y^ +m )| where the sup is 
over all i. We shall use the following lemma. 

Lemma 2. For any permutation 7r d , 

d 

i=l 
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Proof. For any 5 > 0, since limj^. 0O p(i) = 0, we can choose N such 
that p(i) < 5/2 for all i> N. Since lim^oo YU=i A 2 * («) d = 0) we S e ^ 



lim^oo K*{i),d = 0- Thus, we can choose d satisfying J2i=i K* d (i),d < | 
for all d> cLq. With the fact 2~Zf=i = 1 f° r an ^ an d < 1, we get for 
all d> do, 



E ^(O.d/'W = E K*{i),dP( i ) + E K* d {i),dP^) < 5 - 

i=N+l 



i=l 



i=l 



□ 



Now let ir d 1 be the inverse permutation of ir^- Then by Lemma 2 and the 
e-condition, there exists a permutation n d such that 



e(fa, 

\i=i 



7r d \i),d 1 k: a {i),d Z-^i -k a (j),d * J 



i=l 



i=l j'=j+l 



< E A 2 d £ + 2 E A M E Aw^^S 2 - 0, 



i=l i=l j=l 

as d— > oo. Then Chebyshev's inequality gives us, for any r > 0, 



P 



\,d z im 1 



i=l 



> T 



< 



E(Ef=iA - 1(4) ^) 2 



0. 



as d — > oo. Thus, we conclude that the diagonal elements of nSo converge 
to 1 in probability. 

The off-diagonal elements of nSo can be expressed as J2i=i \,d z imZ%i- 
Similar arguments to those used in the diagonal case, together with the fact 
that Zi m and zu are independent, gives that 

\ 2 d d d 



\i=l 



d z im z il 



<E^ + 2 EV E V-w 2 (i-*)-o 



i=l 



i=l j =i+l 



as d — > oo. Thus, by Chebyshev's inequality, the off-diagonal elements of 
nSf) converge to in probability. 

Now, we give the proof for part (2). We begin with the mth diagonal entry 
of nSo, 2~2i=i ^i,d z im- Note that since Y^lZi \,d ~ * by the e-condition, we 
assume k = 1 in (2.3) without loss of generality. 

Let Yi = zf m — 1. Note that the Yi's are independent, E(Y^) = and 
E(y/) < B for all i for some B < oo. Now 



(5.1) 



/ d \ 4 d 

E (YlhdYi) = E E kdhdhdhdYiYjYkYi. 

\; = 1 / i,i,k,l=l 
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Note that terms in the sum of the form EYiYjYkYi, EY?YjY k and EY?Yj are 
if i,j,k,l are distinct. The only terms that do not vanish are those of the 
form EYJ 4 , EY?Y?, both of which are bounded by B. Note that A^ rf 's are 
nonnegative, and hence the sum of squares is less than the square of sum, 
we have J2i=i^td — (Sf=i d) 2 ■ Also note that by the strong e-condition, 
Y? l=1 A? d = {de) Ll = o(d^/ 2 ). Thus, (5.1) is bounded as 

/ d \ 4 d 

e <Y,~ x idB+ E %A B 

\i=l / i=l i=j=£k=l 



<\Y,~ X ld) b + 



\i=l 



f d \2 
\i=l I 



= (O- 

Then Chebyshev's inequality gives us, for any r > 0, 



E X hd Z 



2 

im 



> T 



Summing over d gives X)d=i -f [| J2i=i \,d z i m — 1| > t] < oo and by the Borel- 
Cantelli lemma, we conclude that a diagonal element J2i=i ^i,d z ij converges 
to 1 almost surely. 

The off-diagonal elements of nSo can be expressed as J2i=i \,dZimZu- 
Using similar arguments to those used in the diagonal case, we have 



^ ] ^i,dZimZil 



8=1 



> T 



and again by the Borel-Cantelli lemma, the off-diagonal elements converge 
to almost surely. 



5.2. Proofs of Lemma 1 and Theorem 2. The proof of Theorem 2 is 
divided in two parts. Since eigenvectors are associated to eigenvalues, at first, 
we focus on asymptotic behavior of sample eigenvalues (Section 5.2.1) and 
then investigate consistency or strong inconsistency of sample eigenvectors 
(Section 5.2.2). 



5.2.1. Proof of Lemma 1. The proof relies heavily on the following lemma. 
Recall that (pk(A) denotes the feth largest eigenvalue of A. 



PCA CONSISTENCY IN HDLSS CONTEXT 19 

Lemma 3 (Weyl's inequality). If A, B are to x m real symmetric ma- 
trices, then for all k = 1, . . . , m, 

ip k {A) + <p m (B) } ( tpkW + wiB), 

cp k+1 (A) + <p m -i(B) \ \vk-\{A) + yi{B) } 

. ><<Pk{A + B)<< . 

tp m (A)+cp k (B) ) { l f 1 {A) + i Pk (B). 

This inequality is discussed in Rao [17] and its use on asymptotic studies 
of eigenvalues of a random matrix appeared in Eaton and Tyler [6]. 

Since S and its dual Sd share nonzero eigenvalues, one of the main ideas 
of the proof is working with Sd- By our decomposition (1.1), uSd = Z'AZ. 
We also write Z and A as block matrices such that 



(5.2) 


Z = 


( z * \ 

z 2 


A = 


(M 



• 
A 2 • 


• \ 








\ Z p +i ) 




{o 


o ■ 


■ a p+1 ; 



where Z\ is a ki x n matrix for each I = 1, . . . ,p + l and A/(= M^) is a k\ x k\ 
diagonal matrix for each I = 1, . . . ,p+ 1 and O denotes a matrix where all 
elements are zeros. Now, we can write 

P+i 

(5.3) nS D = Z'kZ = Y u Z[k l Z l . 

l=i 

While Z\ depends on d = 1, . . . , oo, this dependence is not explicitly shown 
(e.g., by subscript) for simplicity of notation. 

Note that Theorem 1 implies that when the last term in equation (5.3) is 
divided by d, it converges to an identity matrix, namely, 

(5.4) d~ l Z' p+l k p+ iZ p+ i -^U nK ■ I n , 

where K £ (0, oo) is such that {dn)~ lJ }2 i& j p+1 Aj jC ; — > K. Moreover, dividing 
by d ai gives us 

p 

nd- ai S D = d- ai Z[KiZx + d- ai Z'iMZi + d^^d" 1 Z' p+l A p+l Z p+l . 

1=2 

By the assumption (b), the first term on the right-hand side converges to 
Z[C\Zi where C\ is the k\ x k\ diagonal matrix such that C\ = diagjcj; j G 
J±} and the other terms tend to a zero matrix. Thus, we get 

nd~ ai Si) Z' X C\Z\ &sd— > oo. 

Note that the nonzero eigenvalues of Z[C\Zi are the same as the nonzero 
eigenvalues of C\^ 2 ZiZ[cl^ 2 which is a k\ x k\ random matrix with full 
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rank almost surely. Since (fi(A) is a continuous function of the entries of A 
(see e.g., Kato [13]), we have for ie J±, 

(Pi(nd~ ai Sp) =^ ipi{Z' x C\Z\) asci^oo 

= l p i (Cl /2 Z 1 Z[C 1 1 /2 ). 

Thus, we conclude that for the sample eigenvalues in the group J±, Aj/cP 1 = 

(pi(d~ ai Sd) converges in distribution to ipi(n~ l c\^ 2 Z\Z' x c\^ 2 ) for i £ J\. 

Let us focus on eigenvalues whose indices are in the group J2 , . . . , J p ■ 
Suppose we have Aj = O p {d a i) for all i 6 Jj, for j = 1, . . . , I — 1. Pick any 
i £ J\. We will provide upper and lower bounds on \ by Weyl's inequality 
(Lemma 3). Dividing both sides of (5.3) by d a ' , we get 

l-i p+i 
nd~ a 'S D = d~ a ' E Z'jAjZj + d~ ai E Z'jhjZj 
i=i 3=1 

and apply Weyl's inequality for the upper bound, 

Vl (nd- a 'S D ) < V 1+E i-i fcj (j"* 1 E 4 A ^i) 

p+i 



(5.5) • V = > (<' " EWi) 

Note that the first term vanishes since the rank of d~ ai jj-fA Z'j^jZj is at 
most Sj=i kj. Also note that the matrix in the upper bound (5.5) converges 
to a simple form 

p+i p+i 
d- a 'Y, z 'j A j z j = d^'Z'tAiZi + d^ 01 E Z j A j z j 

==> Z[CiZ\ as d — > 00, 

where C; is the /c; x diagonal matrix such that C\ = diagjcj-; j € J{\. 

In order to have a lower bound of Aj, Weyl's inequality is applied to the 
expression 

1 p+i 

d ~ ai Y. Z 'i K 3 Z i + d ~ ai E z ' j AjZ J =nd- a 'S D , 
j=i 3=1+1 
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so that 

(5.6) <pi(dr*Y, z, 3 A i Z j)+ < Pn( d ~ ai E Z'jAjZ^Kipiind-^SD). 

\ j=l / \ j=l+l ) 

It turns out that the first term of the left-hand side is not easy to manage, 
so we again use Weyl's inequality to get 



(5.7) 



< Vi (d~ a < Y ZjAjz] + ^^.-i fc . {-d-^Z[A x Z x 

\ j=l ) 3 1 3 



where the left-hand side is since the rank of the matrix inside is at most 
X^=l%- Note that since d~ ai Z[A[Z[ and d~ a ' A 1 ^ 2 Z[Z[A^ 2 share nonzero 
eigenvalues, we get 

~ a i 7'. \,7.A — ,n , i-A~ a i K 1 I 2 7..7 I .K 1 I 2 \ 



(5.8) 



^- l+1+E - fcJ (- d - QiA / 1/2 ^^ 1/2 ) 



(p._ y , l - lk id-^A 1 l /2 Z l Z' l Aj /2 ) 

■I) 



= -^ T ^ k .(d~^Z' l A l Z l 

Here, we use the fact that for any mx m real symmetric matrix A, <Pi{A) = 
—(p m -.i + i(—A) for all i = 1, . . . , m. 

Combining (5.6)-(5.8) gives the lower bound 

(5.9) (d-^ZlAtZO + ipn^d-^ J2 Z'jAjZ^j <^(nd- ai S D ). 

Note that the matrix inside of the first term of the lower bound (5.9) con- 
verges to Z[C[Zi in distribution. The second term converges to since the 
matrix inside converges to a zero matrix. 

The difference between the upper and lower bounds of (pi(nd~ ai Sp) con- 
verges to since 



r i- 



E £ kj (d-°« £ z'A jZ ^ - kj {dT* Z[A lZl ) - 0, 



J-- 

as d — > oo. This is because tp is a continuous function and the difference 
between the two matrices converges to zero matrix. Therefore, (pi{nd~ ai Sd) 
converges to the upper or lower bound as d — > oo. 



22 S. JUNG AND J. S. MARRON 

Now since both upper and lower bound of (fi(nd~ ai Sd) converge in dis- 
tribution to same quantity, we have 

ifi(nd~ ai Sd) => f- v^'- 1 {ZiCiZi) asd^oo. 

(5.10) 



Thus, by induction, we have the scaled ith sample eigenvalue \i/d a ' con- 
verges in distribution to (rT x C\^ 2 ZiZ[C\^ 2 ) for % € Ji, I = 1, . . . ,p, 
as desired. 

Now, let us focus on the rest of the sample eigenvalues Aj, i = k + 1, . . . , n. 
For any i, again by Weyl's upper bound inequality, we get 

tpi(nd~ l S D ) < ipi-^dT 1 Z' p+1 A p+1 Z p+1 ) +^ K+ \ ^cT 1 Zj AjZj J 

= Vi-nid' 1 Z' p+1 A p+ iZ p+ i) , 

where the second term on the right-hand side vanishes since the matrix 
inside is of rank at most n. Also for lower bound, we have 

ipi(nd~ x SD) > l Pi(d~ 1 Z p+1 A p+ iZ p+ i) + <p n ^d~ 1 Y^Z' j A j Z^j 

= <Pi(d~ Z' p+1 Ap + iZ p+ i) , 

where the second term vanishes since k < n. Thus, we have complete bounds 
for ipi{nd Sd) such that 

ipi{d~ Z' p+1 A p+1 Z p+ x) < (p i (nd~ 1 S D ) < (fi- K (d~ Z' p+1 A p+1 Z p+ x) 

for all i = k + 1, . . . ,n. However, by (5.4), the matrix in both bounds con- 
verges to nK ■ I n in probability. Thus, lower and upper bounds of ifi(d~ 1 SD) 
converge to K in probability for i = k, + 1, . . . ,n, which completes the proof. 

5.2.2. Proof of Theorem 2. We begin by defining a standardized version 
of the sample covariance matrix, not to be confused with the dual Sd, as 

~S = A' 1 ' 2 U'SUA- 1 / 2 

(5.11) =A~ 1/2 U'(UAU')UA- 1/2 

= A-V2 PAP > ' A -i/2 5 

where P = U'U = {u^Uj}^ = {pij}ij- Note that elements of P are inner prod- 
ucts between population eigenvectors and sample eigenvectors. Since S is 
standardized, we have by S = n~ l XX' and X = UA^Z, 

(5.12) S = n~ 1 ZZ'. 
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Note that the angle between two directions can be formulated as an inner 
product of the two direction vectors. Thus, we will investigate the behavior 
of the inner product matrix P as d— > oo, by showing that 

(5.13) pjj — — » 1 asti— >oo 

jeJi 

for alH € Ji, I = 1, . . . ,p and 

(5.14) pI^Q asd^oo 

for all i = k + 1, . . . ,n. 

Suppose for now we have the result of (5.13) and (5.14). Then for any 
ie J h l = l,...,p, 

* , /* r • tin ( "iP^spaafiyijeJ,}^] \ 

Angle (u^spanju,- :j G J/}) = arccos t-— r^- — : — ttt 

V «i P • Proj span{ u-yej,}^ lb/ 



arccos 



arccos 



^(EjeJiK^K) 

Sjgj,K"i) 2 

1/2 



2 



— > as <i — ► oo, 



by (5.13) and for i = n + 1, . . . , n, 

Angle(uj,nj) = arccos (|«'jUj|) 

= arccos(|pii|) 
p it 

— > — as a — ► oo, 
2 

by (5.14), as desired. 

Therefore, it is enough to show (5.13) and (5.14). We begin with taking 
jth diagonal entry of S, sjj, from (5.11) and (5.12), 

n 
i=l 

where Zj denotes the jth row vector of Z. Since 
(5.15) X^Xipji < n'^-zjz'j, 
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we have at most 

A 
A 



P?i = P (-r 



for alH = 1, . . . ,n, j = 1, . . . , d. Note that by Lemma 1, we have for i £ Ji t , 
j £ Ji 2 where 1 < l\ < I2 < p + 1, 

f 5 16) v 2 -0 (h\-(°p(dy- ah ), *h<P, 

(5 - 16) P * ~ ° p V A J " I ^(d 1 ^ ), if fa = p + 1, 

so that — > as d — > 00 in both cases. 

Note that the inner product matrix P is also a unitary matrix. The norm 
of the ith column vector of P must be 1 for all d, i.e. J2j=iPji = 1- Thus, 
(5.13) is equivalent to J2je{i,...,d}\Ji Pji as d — > 00. 

Now for any i £ J\, 

E p%= E p%+ E pji- 

je{l,...,d}\-h JGJ 2 U-UJ P j'GJp+i 

Since the first term on the right-hand side is a finite sum of quantities 
converging to 0, it converges to almost surely as d tends to infinity. By 
(5.15), we have an upper bound for the second term, 

E p%= E v^p&r 

Sjgj p+1 re lz jz'j^j d _ ELi Ej= K +i j 

~~ d Aj "d a/ 

where the % fc's are the entries of a row random vector Zj. Note that by apply- 
ing Theorem 1 with T, d = diag{A K+ i, . . . , X d }, we have Ej= K +i z ).k\l& ^ 1 
as d — ► 00. Also by Lemma 1, the upper bound converges to in probability. 
Thus, we get 

E P%~^® asd^oo, 
je{i,...,4Vi 
which is equivalent to 

(5.17) E p\ ^ as d — > 00. 

Let us focus on the group J2, . . . , J p . For any I = 2, . . . ,p, suppose we have 
J2jej m j ^ 1 as d — > 00 for all i £ J m , m = 1, — 1. Note that it implies 
that for any j £ J m , m = 1, — 1, 

(5.18) 2 Pji^O asd^co, 
ie{i,...,4\J m 
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since 



E E 4=EE4-E E*^Ei-Ei=o. 

jeJm i€{l,...,d}\J m jeJ 7 nl = l jeJ m l£Jm j&Jm l&Jm 

as ci — > oo. 

Now, pick i £ Ji- We have 

E 4= E 4+ E 4+ E 

je{l,...,d}\J; jGJiU-UJ ; _i jGJ,+iU-UJ p j'£Jp+l 

Note that the first term is bounded as 



i-i 



E #<E E *<EE E *?< 



jeJiU-'-uJi-i ieJi jeJiU---uJ ; _ 1 



2 \ p 



ljeJm Ne{i,...,rf}\J„ 



by (5.18). The second term also converges to by (5.16). The last term is 
also bounded as 



V r? - V A" 1 A r? Xj < ^ eJ p+ lU X 3^hl d 



so that it also converges to in probability. Thus, we have J2je{i -,d}\Ji Vj% 
as d —> oo which implies that 

E Pji ^ as d — > oo. 



2 p 



jeJi 



Thus, by induction, (5.13) is proved. 



For i = k + 1, . . . , n, we have Aj 1 Xip\ < n l Z{z' i , and so 

which implies (5.14) by the assumption (c) and Lemma 1, and the proof is 
completed. 

5.3. Proof of Corollary 1. The proof follows the same lines as the proof 
of Theorem 2, with convergence in probability replaced by almost sure con- 
vergence. 



5.4. Proof of Corollary 2. From the proof of Theorem 2, write the inner 
product matrix P of (5.11) as a block matrix such that 

Pp,p+i 



/ Pll ■ 


Pip 


Ppl 


P 


\P P +1,1 ■ 


■■ Pp+l 
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where each P^ is a ki x fcj random matrix. In the proof of Theorem 2, we 
have shown that Pa, i = 1, . . . ,p, tends to be a unitary matrix and P y -, 

i 7^ j , tends to be a zero matrix as d — > oo. Likewise, A and A can be blocked 
similarly as A = diag{Aj :i = 1, . . . ,p+ 1} and A = diag{Aj : i = 1, . . . ,p + 1}. 
Now, pick I G {1, . . . The Zth block diagonal of 5", §u, is expressed as 

5 H = J2 P jtl Ai^PijAiPljAi 1 ^. Since P, - 0, i ^ j, we get 

||5«-Af 1/2 i^Aii^Af 1/2 || F -^0 
as d — > oo, where || • || F is the Frobenius norm of matrices defined by = 
(Ey4) 1/2 - 

Note that by (5.12), can be replaced by n~ l ZiZ[. We also have d~ a 'Ai — ► 

Cj by the assumption (b) and d~ a 'Ai 4 ^{^{n~ x c) 12 ZiZ[C) l2 )\ by (5.10). 
Thus, we get 

Hn-^jZ/ - C~ 1/2 P H diag{^(n^ 1 C7/ /2 ^Z^ 1/2 )}P^C7r 1/2 ||F 
as d — ► oo. 

Also note that since n Z\Z\ — > almost surely as n — > oo, we get 

n- x C) 12 Z^C) 12 -> Q and diag{^(?i- 1 q 1/2 Z i Z / , C / 1/2 )} -» Cj almost surely 
as n — ► oo. Using the fact that the Frobenius norm is unitarily invariant and 
||^4P||f < II^II-f||P||f for any square matrices A and B, we get 

WP^QPu - Q\\ F 

< WP'udPu - dmgi^n^C^ZtZlc} 12 )}^ + o p (l) 
= \\Q - P«diag{^(n- 1 Cy 2 ^^ 1/2 )}P/ z || F + 0p (l) 

(5.19) < Wn^C^ZiZlC^ 2 - Pa diag{<p( n - 1 cl /2 Z l Z' l c} /2 )}P{ l \\ F + o p (l) 

< WC^WUn-'Z^ 
-Cr 1/2 P«diag{^(n^C / 1/2 Z i Z^ 1/2 )}P^C / - 1/2 ||F + o p (l) 

— as d, n — > oo. 

Note that in order to have (5.19), Pu must converge to diag{±l, ±1, . . . , ±1} 
since diagonal entries of C\ are distinct and a spectral decomposition is 
unique up to sign changes. Let I = 1 for simplicity. Now for any m = 2, . . . , k\ , 

Pml since 

ki 

HP^dPn - > ^(d - c,) 2 ^ > ( Cl - c™)^, 
i=i 

This leads to A 1 as d, n — ► oo. By induction, A 1 for all i G J;, / = 
1, . . . ,p. Therefore, Angle(uj,Uj) = arccos(|pjj|) — > as d, n — > oo. 
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If the assumptions of Corollary 1 also hold, then every convergence in the 
proof is replaced by almost sure convergence, which completes the proof. 

1/2 

5.5. Proof of Corollary 3. With Gaussian assumption, noticing C, Z[X 

1 /2 

Z[C l ~ Wfc;(n,C;) gives the first result. When k\ = 1, the assumption (b) 
and that C\^ 2 Z[Z[C\^ 2 ~ c%x\ imply that 

Ai \ Cid ai xl , 

— = — : — • — - — =^ — as a — > oo. 

Xi Cid a i Xi n 
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